Data Analysis

Week 6: Generalised Linear Models part 1

Jafet Belmont

School of Mathematics and Statistics

ILOs

By the end of this session you will be able to:

  • Fit a logistic regression model with either a numerical or categorical explanatory variable:

  • Interpret the regression coefficients of the logistic model in terms of their effects on both the odds and the log odds (of the response)

  • Conduct model validation and assess model predictive performance

Overview of today’s session

Logistic regression

This week we will learn how to model outcomes of interest that take one of two categorical values (e.g. yes/no, success/failure, alive/dead), i.e.

  • binary, taking the value 1 (say success, with probability \(p_i\)) or 0 (failure, with probability \(1-p_i\))
  • We are interested in \(\text{Prob}(Event) = p_i\). Suppose the a random variable \(y_i\) for the event of interest.

In this case,

\[ y_i \sim \mathrm{Bin}(1,p_i)\\ g(p_i) = \log \left(\frac{p_i}{1 - p_i} \right) \]

which is also referred to as the log-odds (since \(p_i ~ / ~ 1-p_i\) is an odds ratio).

\[p_i = \frac{\exp\left(\mathbf{x}_i^\top \boldsymbol{\beta}\right)}{1 + \exp\left(\mathbf{x}_i^\top \boldsymbol{\beta}\right)} ~~~ \in [0, 1].\]

Required R packages

Before we proceed, load all the packages needed for this week:

library(tidyverse)
library(ggplot2)
library(sjPlot)
library(broom)
library(performance)
library(yardstick) 
  1. The first libraries ggplot2 and tidyverse allows us to create nice data visualisations and data manipulation.
  2. The second libraries sjPlot and broom allows us summarise and present model output in a tidy format.
  3. The final two libraries (performance and yardstick) are used fro model assessment and validation.

First example - Teaching evaluation scores

evals data set containing student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austins.

Code
evals.gender <- evals %>%
                  select(gender, age)

Fitting a logistic regression in R

\[ \log\left( \frac{p}{1-p} \right) = \alpha + \beta \]

where \(p = \textrm{Prob}\left(\textrm{Male}\right)\).

  gender
Predictors Log-Odds p
(Intercept) -2.70
(-3.72 – -1.71)
<0.001
age 0.06
(0.04 – 0.08)
<0.001
  • the log-odds of the instructor being male increase by 0.06 for every one unit increase in age.
  gender
Predictors Odds Ratios p
(Intercept) 0.07
(0.02 – 0.18)
<0.001
age 1.06
(1.04 – 1.09)
<0.001
  • for every 1 unit increase in age, the odds of the teaching instructor being male increase by a factor of 1.06.

Relationship between between Odds and Probabilities

Table 1: Relationship between Odds and Probabilities
Scale Equivalence
Odds \[ Odds = \mathrm{exp}(log Odds) = \dfrac{P(event)}{1-P(event)} \]
Probability \[ P(event) =\dfrac{\mathrm{exp}(logOdds)}{1+\mathrm{exp}(logOdds)} = \dfrac{Odds}{1+Odds} \]

Model evaluation

library(performance)
check_model(model, check = c("pp_check","binned_residuals","outliers","qq"))

Predictive performance metrics

How well our model predicts new observations?

  • compute the predicted classes and compare them against the observed values.

  • We typically classify these probabilities into discrete classes based on a threshold (commonly 0.5 for binary classification)We can set a threshold

Code
pred_results = model %>% 
  broom::augment(type.predict = c("response")) %>%
  mutate(predicted_class = 
           factor(ifelse(.fitted > 0.5, "male", "female")))
gender age .fitted .resid .hat .sigma .cooksd .std.resid predicted_class
female 36 0.39 -1.00 0.01 1.13 0 -1.00 female
female 36 0.39 -1.00 0.01 1.13 0 -1.00 female
female 36 0.39 -1.00 0.01 1.13 0 -1.00 female
female 36 0.39 -1.00 0.01 1.13 0 -1.00 female
male 59 0.73 0.79 0.00 1.13 0 0.79 male

Predictive performance metrics

We can use these predicted classes to compute different predictive performance/evaluation metrics

  • The correct classification rate (CCR) or accuracy describes the overall proportion of teaching instructors (males or females) that were classified correctly.

  • The true positive rate (TPR) or sensitivity (a.k.a. recall), denotes the proportion of actual male instructors that are correctly classified as males by the model.

  • The true negative rate (TNR) or specificity, denotes the proportion of actual females that have been classified correctly as females by the model.

  • The model’s precision or positive predictive value (PPV) represents the proportion of predicted male instructors that were actually male

  • The model’s negative predictive value (NPV) represents the proportion of predicted female instructors .

ROC Curve

  • Plot true positive rate (sensitivity) and the false positive rate (1 - specificity) at various threshold levels.

  • The closer the ROC curve is to the top-left corner, the better the model is at distinguishing between the positive and negative classes

Logistic regression with one categorical explanatory variable

Instead of having a numerical explanatory variable such as age, let’s now use the binary categorical variable ethnicity as our explanatory variable.

Fitting the model

evals.ethnic <- evals %>%
                  select(gender, ethnicity)
model.ethnic <- glm(gender ~ ethnicity,
                    data = evals.ethnic,
                    family = binomial) 
  gender
Predictors Log-Odds p
(Intercept) -0.25
(-0.75 – 0.24)
0.319
ethnicity [not minority] 0.66
(0.13 – 1.20)
0.015

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

    • When the instructor belongs to the reference category minority the models simplifies to: \[\mathrm{log}\left(\frac{p_i}{1-p_i}\right) = \alpha = -0.25 \]

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

    • When the instructor belongs to the reference category minority the models simplifies to: \[\mathrm{log}\left(\frac{p_i}{1-p_i}\right) = \alpha = -0.25 \]
    • \(\text{Odds}(\text{male=1|minority=1}) = \left[\dfrac{P(\mathrm{male}=1 |\mathrm{minority}=1)}{P(\mathrm{female}= 1 |\mathrm{minority}=1)}\right]=\exp(\alpha) = \exp(-0.25) = 0.78\)

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

    • When the instructor belongs to the reference category minority the models simplifies to: \[\mathrm{log}\left(\frac{p_i}{1-p_i}\right) = \alpha = -0.25 \]
    • \(\text{Odds}(\text{male=1|minority=1}) = \left[\dfrac{P(\mathrm{male}=1 |\mathrm{minority}=1)}{P(\mathrm{female}= 1 |\mathrm{minority}=1)}\right]=\exp(\alpha) = \exp(-0.25) = 0.78\)
    • \(\text{Odds}(\text{male=0|minority=1}) = \left[\dfrac{P(\mathrm{male}=1 |\mathrm{minority}=1)}{P(\mathrm{female}= 1 |\mathrm{minority}=1)}\right]^{-1} = \exp(\alpha)^{-1} = \exp(-0.25)^{-1} = 1.28\)

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

  • \(\beta_{\mathrm{ethnicity}}\) is the coefficient for the predictor \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority})\), which shows how the log-odds change when moving from the reference category (minority) to the other level (not minority).

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

  • \(\beta_{\mathrm{ethnicity}}\) is the coefficient for the predictor \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority})\), which shows how the log-odds change when moving from the reference category (minority) to the other level (not minority).

    • When the instructor does not belong to reference category, i.e. \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 1\), the model becomes: \[\mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}}\]

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

  • \(\beta_{\mathrm{ethnicity}}\) is the coefficient for the predictor \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority})\), which shows how the log-odds change when moving from the reference category (minority) to the other level (not minority).

So, the log-odds of the instructors being male in the not minority group are \(\alpha +\beta_{\text{ethnicity}}\).

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\text{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\).

  • \(\beta_{\mathrm{ethnicity}}\) is the coefficient for the predictor \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority})\), which shows how the log-odds change when moving from the reference category (minority) to the other level (not minority).

So, the log-odds of the instructors being male in the not minority group are \(\alpha +\beta_{\text{ethnicity}}\).

On The odd scale this is:

\[ \text{Odds}(\text{male=1|minority=0}) = \exp(\alpha + \beta_{\text{ethnicity}} ) = \exp(-0.25 + 0.66) \]

Odds ratio

The odds-ratio of an instructor being male in the not minority group compared to the minority ethnic group is given by the following Odds ratio:

\[ \begin{aligned} \frac{\mathrm{Odds}(\mathrm{male} = 1| \mathrm{minority} = 0)}{\mathrm{Odds}(\mathrm{male} = 1| \mathrm{minority} = 1)} &= \dfrac{\frac{p_{(\mathrm{minority} = 0)}}{1- p_{(\mathrm{minority} = 0)}}}{\frac{p_{(\mathrm{minority}=1)}}{1- p_{(\mathrm{minority}=1)}}} \\ &= \frac{\mathrm{exp}( \alpha + \beta_{\text{ethnicity}})}{\exp\left(\alpha\right)}\\ &= \exp\left(\alpha + \beta_{\text{ethnicity}} - \alpha\right) \\ &= \exp\left(\beta_{\text{ethnicity}}\right) &= 1.94 \end{aligned} \] This means that instructors that are not in the minority groups are significantly 1.94 times more likely to be males compared to instructors in the minority group.

The steps ahead

  • Calculate, using R, the logistic regression coefficients on the odds and probability scales when either a continuous or a categorical exploratory variable is used in the model.

  • Correct interpretation of the of the model parameters of a fitted logistic regression (in terms of the log-odds, odds and probabilities) with either a continuous or a categorical exploratory variable .

  • Check how to visualize the results of a fitted logistic regression with with either a continuous or a categorical exploratory variable.

  • Interpret the model diagnostic plots and predictive performance metrics of a logistic regression model.